Plots can provide a useful visual summary of the data. Sometimes, a nice plot or two is all that is need for statistical analysis. In this document, we cover a basic overview of creating some plots in R.
Here’s a link to a more thorough coverage of plotting in R: https://r-graph-gallery.com/index.html.
ggplot2mapviewThe plotting functions introduced in this document have robust help
documentation with lots of options to customize your plots. If you want
to view help documentation for any of the functions used in this
document, run commands such?hist, ?plot,
?table, and so on.
To demonstrate how to create common statistical plots in R, we will
use the storms dataset which is located in the package
dplyr.
dplyr package should already be installed (if not
look for the link to install at the top of this window).dplyr.# load the library of function and data in dplyr
library(dplyr)
The package dplyr contains a dataset called
storms. Let’s check out the documentation for the data.
?storms # must load dplyr
# See a summary of all variables
summary(storms)
## name year month day
## Length:11859 Min. :1975 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:1992 1st Qu.: 8.000 1st Qu.: 8.00
## Mode :character Median :2002 Median : 9.000 Median :16.00
## Mean :2001 Mean : 8.785 Mean :15.83
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:24.00
## Max. :2020 Max. :12.000 Max. :31.00
##
## hour lat long status
## Min. : 0.000 Min. : 7.20 Min. :-109.30 Length:11859
## 1st Qu.: 6.000 1st Qu.:17.50 1st Qu.: -80.70 Class :character
## Median :12.000 Median :24.60 Median : -64.40 Mode :character
## Mean : 9.117 Mean :24.76 Mean : -64.09
## 3rd Qu.:18.000 3rd Qu.:31.30 3rd Qu.: -48.40
## Max. :23.000 Max. :51.90 Max. : -6.00
##
## category wind pressure tropicalstorm_force_diameter
## -1:2898 Min. : 10.00 Min. : 882 Min. : 0.0
## 0 :5347 1st Qu.: 35.00 1st Qu.: 985 1st Qu.: 60.0
## 1 :1934 Median : 45.00 Median : 999 Median :120.0
## 2 : 749 Mean : 53.64 Mean : 992 Mean :145.3
## 3 : 434 3rd Qu.: 65.00 3rd Qu.:1006 3rd Qu.:210.0
## 4 : 411 Max. :160.00 Max. :1022 Max. :870.0
## 5 : 86 NA's :6509
## hurricane_force_diameter
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 18.15
## 3rd Qu.: 25.00
## Max. :300.00
## NA's :6509
The hist function can be used create a histogram of a numerical
vector.
hist(storms$pressure,
xlab = "storm pressure (in millibars)",
main = "Distribution of Storm Pressure 1975-2020",
breaks = 10,
col = "aquamarine4")
Sometimes we may prefer to use a density plot over the histogram because the histogram is more sensitive to its options.
plot(density(storms$pressure),
xlab = "storm pressure (in millibars)",
main = "Distribution of Storm Pressure 1975-2020")
Boxplots are another useful plot for presenting the distribution of a quantitative variable using the five number summary.
boxplot(storms$pressure,
ylab = "storm pressure (in millibars)",
main = "Distribution of Storm Pressure 1975-2020")
boxplot(storms$pressure,
xlab = "storm pressure (in millibars)",
main = "Distribution of Storm Pressure 1975-2020",
horizontal = TRUE)
Qualitative variables may be entered as characters (such as
status of storm) or values (such as
category).
status is initially read as character
strings, and not as a factor.category is initially read as character
strings, and not as a factor.typeof(storms$category)
## [1] "integer"
typeof(storms$status)
## [1] "character"
plot()If we try to use the general plot() function, R will
give its best guess at which plot makes the most sense to display the
data.
plot()
will not work as we might expect.table() or convert the data
type to factor first if you prefer to use
plot().plot(storms$status)
table(),
prop.table() and barplot()The command table(x) will count the number of times a
value (or string of characters) occurs in x.
status.table <- table(storms$status)
status.table
##
## hurricane tropical depression tropical storm
## 3613 2898 5348
After creating a table, we can present the results visually in a bar chart.
barplot(status.table, main = "Distribution of Storm Status",
ylab = "Frequency",
col = "steelblue")
pie()Pie Charts can also be used to illustrate the distribution of one qualitative variable.
pie(status.table, #labels = status.table,
main = "Distribution of Storm Status")
The command prop.table([table_name]) will count the
proportion of all observations that a value (or string of characters)
occurs in the table named [table_name].
is.table(storms$status)
## [1] FALSE
is.table(status.table)
## [1] TRUE
prop.table(storms$status) will return an
error since storms$status is not a table.status.prop <- prop.table(status.table)
status.prop
##
## hurricane tropical depression tropical storm
## 0.3046631 0.2443714 0.4509655
barplot(status.prop, main = "Relative Frequency of Storm Status",
ylab = "Proportion",
col = "steelblue")
factor() and Using
plot()factor in
R.storms.fac <- storms # creates a second copy so we don't overwrite storms
storms.fac$status <- factor(storms$status) # convert status to factor
summary(storms.fac$status) # get new summary of factors
## hurricane tropical depression tropical storm
## 3613 2898 5348
status is converted to a factor, we
can plot the data using plot(). R assumes factors are best
displayed with a bar chart.plot(storms.fac$status,
main = "Distribution of Storms Status 1975-2020",
xlab = "Storm Status",
ylab = "Frequency",
col = "steelblue")
Imagine we would like to compare the wind speeds of tropical storms, hurricanes, and tropical depressions.
storms has variable
status stored as characters (not factors).# split data by storm status
hur <- filter(storms, status == "hurricane")
trop.storm <- filter(storms, status == "tropical storm")
trop.dep <- filter(storms, status == "tropical depression")
# Create side by side boxplot
boxplot(hur$wind, trop.storm$wind, trop.dep$wind,
main = "Windspeed of Storms (knots)",
names = c("Hurricanes", "Tropical Storms", "Tropical Depressions"),
col = c("red", "blue", "green"),
horizontal = TRUE)
plot()factor in
R.storms.fac <- storms # creates a second copy so we don't overwrite storms
storms.fac$status <- factor(storms$status) # convert status to factor
status is converted to a factor, R
will assume we want to plot each class of the factor on a separate plot.
Since wind is numeric, plot() will create a
boxplot for each category.plot(wind ~ status, data = storms.fac,
col = c("forestgreen", "blueviolet", "azure4"))
table(),
prop.table() and barplot()The command table(x) will count the number of times a
value (or string of characters) occurs in x.
con.table <- table(storms$status, storms$month)
con.table
##
## 1 4 5 6 7 8 9 10 11 12
## hurricane 5 0 0 18 100 776 1793 701 187 33
## tropical depression 2 0 40 169 349 717 1085 379 136 21
## tropical storm 23 13 50 187 474 1214 2016 913 387 71
After creating a table, we can present the results visually in a group bar chart.
barplot(con.table, beside = TRUE, # groups side-by-side
legend=rownames(con.table),
main = "Storm Types Each Month: 1975-2020",
xlab = "Month",
ylab = "Frequency")
beside = FALSE is the default.beside option, a stacked
barplot is created.barplot(con.table, # groups side-by-side
legend=rownames(con.table),
main = "Storm Types Each Month: 1975-2020",
xlab = "Month",
ylab = "Frequency")
table(), prop.table() and
barplot()table(x, y).prop.table([table_name]) to convert to
frequencies to proportions out of the grand total.con.prop <- prop.table(con.table) # contingency table percentage of all observations
con.prop
##
## 1 4 5 6
## hurricane 0.0004216207 0.0000000000 0.0000000000 0.0015178346
## tropical depression 0.0001686483 0.0000000000 0.0033729657 0.0142507800
## tropical storm 0.0019394553 0.0010962138 0.0042162071 0.0157686146
##
## 7 8 9 10
## hurricane 0.0084324142 0.0654355342 0.1511931866 0.0591112235
## tropical depression 0.0294291256 0.0604604098 0.0914916941 0.0319588498
## tropical storm 0.0399696433 0.1023695084 0.1699974703 0.0769879416
##
## 11 12
## hurricane 0.0157686146 0.0027826967
## tropical depression 0.0114680833 0.0017708070
## tropical storm 0.0326334430 0.0059870141
Often, we would like the proportions in the table to be computed out of the total in each row (instead of the grand total).
1 inside
prop.table().con.prop.row <- prop.table(con.table,1) # contingency table percentage row total
con.prop.row
##
## 1 4 5 6
## hurricane 0.0013838915 0.0000000000 0.0000000000 0.0049820094
## tropical depression 0.0006901311 0.0000000000 0.0138026225 0.0583160801
## tropical storm 0.0043006731 0.0024308153 0.0093492895 0.0349663426
##
## 7 8 9 10
## hurricane 0.0276778301 0.2147799613 0.4962634929 0.1940215887
## tropical depression 0.1204278813 0.2474120083 0.3743961353 0.1307798482
## tropical storm 0.0886312640 0.2270007479 0.3769633508 0.1707180254
##
## 11 12
## hurricane 0.0517575422 0.0091336839
## tropical depression 0.0469289165 0.0072463768
## tropical storm 0.0723635004 0.0132759910
Other times, we would like the proportions in the table to be computed out of the total in each column.
2 inside
prop.table().con.prop.col <- prop.table(con.table,2) # contingency table percentage column total
con.prop.col
##
## 1 4 5 6 7
## hurricane 0.16666667 0.00000000 0.00000000 0.04812834 0.10834236
## tropical depression 0.06666667 0.00000000 0.44444444 0.45187166 0.37811484
## tropical storm 0.76666667 1.00000000 0.55555556 0.50000000 0.51354280
##
## 8 9 10 11 12
## hurricane 0.28666420 0.36636698 0.35173106 0.26338028 0.26400000
## tropical depression 0.26486886 0.22170004 0.19016558 0.19154930 0.16800000
## tropical storm 0.44846694 0.41193298 0.45810336 0.54507042 0.56800000
Now we can create a stacked relative frequency barplot.
barplot(con.prop.col,
legend=rownames(con.prop.col),
main = "Storm Types Each Month: 1975-2020",
xlab = "Month",
ylab = "Proportion")
plot()factor in
R.storms.fac <- storms # creates a second copy so we don't overwrite storms
storms.fac$status <- factor(storms$status) # convert status to factor
storms.fac$month <- factor(storms$month) # convert month to factor
status and month are
converted to factors, the plot() function in R will
recognize the data types and generate a stacked barplot by default.plot(status ~ month, data = storms.fac,
col = c("forestgreen", "blueviolet", "azure4"))
Bivariate scatter plots can be used to identify the relationship between two quantitative variables.
plot(wind ~ pressure, data = storms,
main = "Relation of Pressure and Wind Speed of Storms",
xlab = "Pressure (in millibars)",
ylab = "Wind Speed (in knots)")
par(mfrow =c(n,m) creates an array of \(n\) rows and \(m\) columns.par(mfrow = c(1, 1)) if you want to go back to displaying
one plot per window.par(mfrow = c(2, 2)) # Create a 2 x 2 array of plots
# The next 5 plots created will be arranged in the array
boxplot(storms$wind) # create boxplot of wind speed
# Code below creates a histogram of wind speed
# We can add many options to customize
hist(storms$wind, xlab = "wind speed (in knots)", # x-axis label
ylab = "Frequency", # y-axis label
main = "Distribution of Storm Wind Speed 1975-2020", # main label
col = "steelblue") # change color of bars
plot(storms.fac$status, col = "gold") # plots status, which is categorical
plot(wind ~ pressure, data = storms) # plots two numerical variables
par(mfrow = c(1, 1)) # change settings so one image displayed in a window
# Compare numerical wind speed for different categories of storms
plot(wind ~ status, data = storms.fac, col = "springgreen4")
ggplot2The previous plots were created using R’s base graphics system.
A fancier alternative is to construct plots using the
ggplot2 package.
gg stands for Grammar of
Graphics.In its simplest form, to construct a (useful) plot in
ggplot2, you need to provide:
ggplot object.
ggplot via the first
data argument.geom_histogram for a histogram,
geom_point for a scatter plot, geom_density
for a density plot.x
variable, the y variable, which variable will control color
in the plots, etc.Help -> Cheat Sheets -> Data Visualization with ggplot2.ggplot2library(ggplot2) # make sure you have installed ggplot2 package
ggplot2ggplot(storms, aes(x = wind)) +
geom_histogram(fill = "steelblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(storms, aes(x = wind)) +
geom_density(color="red") +
theme_bw() # adding theme_bw() makes white background
# adding theme_bw() makes white background
ggplot(storms, aes(x = wind)) +
geom_boxplot(color="black", fill="blueviolet")
ggplot2ggplot(storms) +
geom_point(aes(x = pressure, y = wind))
In general, scaling is the process by which
ggplot2 maps variables to unique values. When this is done
for discrete or categorical variables, ggplot2 will often
scale the variable to distinct colors, symbols, or sizes, depending on
the aesthetic mapped.
In the example below, we map the status variable to the
color aesthetic, which is then scaled to different colors
for the different status levels.
ggplot(storms) +
geom_point(aes(x = pressure, y = wind, color = status))
Alternatively, we can map the status variable to the
shape aesthetic, which creates a plot with different shapes
for each observation based on the status level.
ggplot(storms) +
geom_point(aes(x = pressure, y = wind, shape = status))
We can even combine these two aesthetic mappings in a single plot to
get different colors and symbols for each level of month
and status, respectively.
ggplot(storms) +
geom_point(aes(x = pressure, y = wind, color = month, shape = status))
ggplot2Faceting creates separate panels (facets) of a data frame based on one or more faceting variables.
Below, we facet the data by the category.
ggplot(storms) +
geom_point(aes(x = pressure, y = wind)) +
facet_grid(~ category)
ggplot2# stacks bars on top of each other
ggplot(storms, aes(x=month)) +
geom_bar(aes(fill=status), stat = "count", position="stack") +
ggtitle("Occurrence of Storms by Month: 1975-2020")
# side by side groupings
ggplot(storms, aes(x=month)) +
geom_bar(aes(fill=status), stat = "count", position="dodge") +
ggtitle("Occurrence of Storms by Month: 1975-2020")
# stacks bars and standardizing each stack
ggplot(storms, aes(x=month)) +
geom_bar(aes(fill=status), stat = "count", position="fill") +
ggtitle("Occurrence of Storms by Month: 1975-2020")
mapviewlibrary(mapview) # load spatial mapping package
mapview(storms, xcol = "long", ycol = "lat",
zcol = "status",
crs = 4269, grid = FALSE)